-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Splits for HSCDataSet #105
Conversation
mtauraso
commented
Oct 23, 2024
- HSCDataSet has been significantly modified to implement splits
- Constructors for fibad_data_sets now take a split argument
- Tests pass, but new functionality not yet tested
- New configs added in new "prepare" section to define splits
- Some configs in "model" reorganized to a "train" section that is similar to the "prepare" "predict" and "download" sections in that it configures the action more than anything else.
- Added "split" config to train and predict sections to select the split that will be used.
- HSCDataSet has been significantly modified to implement splits - Constructors for fibad_data_sets now take a split argument - Tests pass, but new functionality not yet tested - New configs added in new "prepare" section to define splits - Some configs in "model" reorganized to a "train" section that is similar to the "prepare" "predict" and "download" sections in that it configures the action more than anything else. - Added "split" config to train and predict sections to select the split that will be used.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #105 +/- ##
==========================================
+ Coverage 34.83% 41.10% +6.26%
==========================================
Files 18 18
Lines 887 1017 +130
==========================================
+ Hits 309 418 +109
- Misses 578 599 +21 ☔ View full report in Codecov by Sentry. |
src/fibad/data_sets/hsc_data_set.py
Outdated
|
||
""" | ||
|
||
def __init__(self, config, split: Union[str, None] = None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default value of split=None
here is to get some tests to pass short-term. The intent is to have the function signature for any @fibad_data_set
__init__()
require a split to be provided, or you explicitly pass None
because it was in a config.
return len(self.current_split) | ||
|
||
|
||
class HSCDataSetSplit(Dataset): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class had more functionality in the first draft of this, but it now looks a lot like a glorified numpy masked array.
I ended up writing it this way to ensure there's only ever one HSCDatasetContainer
I expect that object to have bookeeping for ~10M files at runtime, and I essentially never want a copy made of that object.
If anyone has suggestions to preserve object lifetimes while rewriting this to be shorter with numpy.ma
I'm interested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I think this looks good for now. The amount of extra code in the hsc_data_set module makes me a little nervous about future users needing to implement similar code for splits, but I also think that is something we can address down the line.
And I can work with Max to figure out how to adjust the kbmod-ml dataset module to work within this new paradigm.
src/fibad/fibad_default_config.toml
Outdated
# If `int`, represents the absolute number of test samples. | ||
# If `false`, the value is set to the complement of the train size. | ||
# If `train_size` is also `false`, it will be set to `0.25`. | ||
test_size = 0.6 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a trivial comment, but I would expect test
to be the last in the list. In my mind at least I always think in order of train, validate, test.
Also, perhaps this is ignorance talking, but wouldn't typical values for these be:
- test_size = 0.2
- train_size = 0.6
- validation_size = 0.2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to fix this, because like you say it is trivial to fix.
I'm also going to adopt your values, because I'm not sure where I got those values from 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section looks fine for now. I just searched a bit for CIFAR10 validation sets, and found an inspirational gist here: https://gist.github.com/kevinzakka/d33bf8d6c7f06a9d8c76d97a7879f5cb#file-data_loader-py-L90-L109
Also, I was unaware of the sampler
parameter that can be passed to DataLoader. I do wonder if it would be helpful, of just get in the way of using HuggingFace datasets.
This also worried me while writing it. I'm going to think about what an external dataset writer can do as we implement huggingface. Hopefully there is an easy way to essentially explain one's data to huggingface and then ask it to perform the split. I tried to fashion this after their (and scikit.learn's) split code in the hopes that when we get there we're interface compatible. That choice did make the implementation of this PR somewhat more complex than just "add splits" |
815cc82
to
352481a
Compare
352481a
to
f173f9b
Compare